Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimized deletion for Trie/TrieMap #525

Merged
merged 24 commits into from
Feb 28, 2023
Merged

Optimized deletion for Trie/TrieMap #525

merged 24 commits into from
Feb 28, 2023

Conversation

luc-blaeser
Copy link
Contributor

@luc-blaeser luc-blaeser commented Feb 14, 2023

Update: The current AssocList.replace() with null is indeed deleting.

This PR only optimizes the Trie and TrieMap delete functionality that it can free more memory.
Leaves (but not the branches) are recursively collapsed along the delete path in the trie.

  • Unit testing: Added randomized tests for TrieMap, implicitly also testing Trie.
  • GC benchmark: Shows that more heap space can be reclaimed than with the current implementation.

src/Trie.mo Outdated Show resolved Hide resolved
@luc-blaeser luc-blaeser self-assigned this Feb 14, 2023
Extracting a helper function that does not access the closure.
src/Trie.mo Outdated Show resolved Hide resolved
Co-authored-by: Claudio Russo <[email protected]>
src/AssocList.mo Outdated Show resolved Hide resolved
src/Trie.mo Outdated
public func remove<K, V>(trie : Trie<K, V>, key : Key<K>, equal : (K, K) -> Bool) : (Trie<K, V>, ?V) {
func rec(trie : Trie<K, V>, bitPosition : Nat) : (Trie<K, V>, ?V) {
switch trie {
case (#empty) { (#empty, null) };
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks plausible to me, but in
#427 (comment)
Joachim warns against a broken path compression optimization in filter or map. Does that warning apply here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed. The path cannot be compressed for branch nodes because the bit positions matter (as Joachim correctly mentioned). This is why I only collapse the leaf nodes (with an empty sibling or another sibling leaf). Or do I miss something?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure. I wonder if you can delete a branch if both trees are empty. Can you wind up with nested branches that are actually empty at their leaves or does the recursion eliminate that coming back up the tree?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added collapsing two empty leaves (since leaf() function also returns an empty node for an empty leaf).
After deleting all elements in a 1000_000 trie, the trie is again empty (no branches contained any longer).

Copy link
Contributor

@crusso crusso left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, but would be nice if @matthewhammer could take a look.

@crusso crusso self-requested a review February 14, 2023 14:50
@matthewhammer
Copy link
Contributor

(The existing remove logic only stored a logical delete flag, null, in the cleared entry and therefore was leaking memory.)

@luc-blaeser I do not follow this claim.

The existing logic, as with the new logic proposed here, uses association lists. Those lists then must do the delete physically for the Trie or TrieMap to have this property.

Looking at the existing replace logic (which subsumes both insert and remove and implements both with a more general type), I see that it does in fact "shrink" the list when it finds the target element of the remove (again, phrased as a replace):

        case (?((hd_k, hd_v), tl)) {
          if (equal(key, hd_k)) {
            // if value is null, remove the key; otherwise, replace key's old value
            // return old value
            switch value {
              case (null) { (tl, ?hd_v) };
              case (?value) { (?((hd_k, value), tl), ?hd_v) }
            }
          } else {
          ...
          }

So we see that in the subcase for null, the copied list really is shrinking.

I realize that this PR is doing some orthogonal improvements around this operation, but I remain very confused by the claim that the existing logic does not actually release memory (after GC over a TrieMap that uses this logic, it should, right)?

As replace with null is deleting in AssocList
@luc-blaeser
Copy link
Contributor Author

(The existing remove logic only stored a logical delete flag, null, in the cleared entry and therefore was leaking memory.)

@luc-blaeser I do not follow this claim.

The existing logic, as with the new logic proposed here, uses association lists. Those lists then must do the delete physically for the Trie or TrieMap to have this property.

Looking at the existing replace logic (which subsumes both insert and remove and implements both with a more general type), I see that it does in fact "shrink" the list when it finds the target element of the remove (again, phrased as a replace):

        case (?((hd_k, hd_v), tl)) {
          if (equal(key, hd_k)) {
            // if value is null, remove the key; otherwise, replace key's old value
            // return old value
            switch value {
              case (null) { (tl, ?hd_v) };
              case (?value) { (?((hd_k, value), tl), ?hd_v) }
            }
          } else {
          ...
          }

So we see that in the subcase for null, the copied list really is shrinking.

I realize that this PR is doing some orthogonal improvements around this operation, but I remain very confused by the claim that the existing logic does not actually release memory (after GC over a TrieMap that uses this logic, it should, right)?

Thanks Matthew for pointing this out. You are right, I found a mistake in my benchmark case (deleting less than allocating). I am very sorry for this confusion.

So, after re-measuring (hopefully correct this time), the benchmark shows that a majority of the memory can be reclaimed after deleting the entries in the TrieMap. Only the (empty) leaves in the trie remain after delete. This PR only adds a small optimization insofar that it also combines and frees some of the leaf nodes.

Measurement results, creating 1_000_000 (Nat, Nat) entries in a TrieMap, then removing all of them (several rounds):

Heap size

After allocation After deletion (current code) After deletion (with this PR)
75MB 11 MB 2MB

The question is on whether we should consider this PR or close it for the moment.

@matthewhammer
Copy link
Contributor

matthewhammer commented Feb 14, 2023

@luc-blaeser Okay, I am relieved. No worries!

Before you mentioned finding an actual measurement-related bug, I started wondering if the Trie is just so wasteful (rebuilding those little lists in a pure way) that it seemed like there was a leak, as you alluded to today in the meeting.

FWIW, I apologize for the Trie being cryptic in its implementation. When I started rereading that code snippet this morning I started doubting if I even understood it, then I did again, eventually. :)

This PR only adds a small optimization insofar that it also combines and frees some of the leaf nodes.

In addition to the space trimming for delete, the PR also adds more unit tests that are randomized.

Both of those improvements seem valuable to me!

BTW, I didn't consider these cases for compressing subtrees when I was "fixing" the earlier, broken way that filter had done it.

Now, filter uses this logic to compress its output tree:

          if (isEmpty(fl) and isEmpty(fr)) {
            #empty
          } else {
            branch(fl, fr)
          }

But it's missing some compression opportunities when one subtree is a leaf and the other is null.

Your logic in the PR could expand to improve filter too, it seems?

@luc-blaeser luc-blaeser changed the title Physical deletion for Trie/TrieMap Optimized deletion for Trie/TrieMap Feb 14, 2023
@crusso
Copy link
Contributor

crusso commented Feb 15, 2023

The space savings (and more tests) still look worthwhile to me, so I'd suggest continuing with the PR or merging in its current state (perhaps with TODO to improve filter and mapFilter (if we have that)..

Is it possible to get to the situation where removing all the elements from a tree gets you back to #empty or is that just too expensive to arrange? I'm just curious, not necessarily suggesting we reach for that goal.

@luc-blaeser
Copy link
Contributor Author

luc-blaeser commented Feb 15, 2023

The space savings (and more tests) still look worthwhile to me, so I'd suggest continuing with the PR or merging in its current state (perhaps with TODO to improve filter and mapFilter (if we have that)..

Is it possible to get to the situation where removing all the elements from a tree gets you back to #empty or is that just too expensive to arrange? I'm just curious, not necessarily suggesting we reach for that goal.

I tested it, and the tree is actually empty again after the removal of all entries. I will also look at the filter and mapFilter soon.

Update: The empty node collapsing logic has been applied to filter and mapFilter too.

/// trie := Trie.put(trie, key "test", Text.equal, 1).0;
/// trie := Trie.replace(trie, key "test", Text.equal, 42).0;
/// assert (Trie.get(trie, key "hello", Text.equal) == ?42);
/// ```
public func replace<K, V>(t : Trie<K, V>, k : Key<K>, k_eq : (K, K) -> Bool, v : ?V) : (Trie<K, V>, ?V) {
Copy link
Contributor

@crusso crusso Feb 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this would produce a little less garbage by communicating the previous value using private state and having rec simply return a trie, not a pair, as in RedBlackTree.insert/replace (or what ever it is called). Maybe not worth it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion. I implemented this.

src/Trie.mo Outdated Show resolved Hide resolved
src/Trie.mo Outdated Show resolved Hide resolved
src/Trie.mo Outdated Show resolved Hide resolved
src/Trie.mo Show resolved Hide resolved
src/Trie.mo Show resolved Hide resolved
src/Trie.mo Outdated Show resolved Hide resolved
src/Trie.mo Show resolved Hide resolved
Copy link
Contributor

@crusso crusso left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice. Just made some minor suggestions to remove parens and perhaps optimize replace, but all optional.

There's probably similar spurious parens in patterns in the rest of the file, but I can't make GH suggestions on unchanged lines. Probably not worth fixing.

@luc-blaeser
Copy link
Contributor Author

Nice. Just made some minor suggestions to remove parens and perhaps optimize replace, but all optional.

There's probably similar spurious parens in patterns in the rest of the file, but I can't make GH suggestions on unchanged lines. Probably not worth fixing.

Thanks a lot for all these valuable suggestions. Would you like to push the parentheses optimization?

src/Trie.mo Outdated Show resolved Hide resolved
Copy link
Contributor

@crusso crusso left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM apart from the redundant assignment.

@luc-blaeser luc-blaeser merged commit f713607 into master Feb 28, 2023
@luc-blaeser luc-blaeser deleted the luc/trie-delete branch February 28, 2023 13:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants